Introduction to Working with Spatial Data

Prepared & taught by Andrew Fair, for NYULH Data Day-to-Day (Feb 15, 2022)
Associate Director of Research Environment, Dept of Population Health
My background is in public health & informatics. I've taken a few GIS & web mapping classes.
andrew.fair@nyulangone.org

  1. What is GIS?
  2. What is a Spatial or Coordinate Reference System?
  3. What is spatial data? Explanation of common formats
  4. What is geocoding?
  5. How to perform an attribute join between spatial data and tabular data
  6. How to perform a spatial join between points and polygons
  7. How to create a choropleth map

What is GIS?

A Geographic Information System (GIS) is a computer system that analyzes and displays geographically referenced information. It uses data that is attached to a unique location.
https://www.usgs.gov/faqs/what-geographic-information-system-gis

Some examples of how GIS can be used in public health:

jhu_covid_map.png
https://coronavirus.jhu.edu/map.html

Myth 1: GIS = Mapping

GIS is much more than mapping! Most of our lesson today won't involve maps.
But let's be honest, maps are fun!

Myth 2: GIS requires geocoding

It's true that GIS generally entails geographically referenced data. But there is a lot you can do without ever running a geocoding program, much of which doesn't involve addresses or latitude/longitude points at all.

Myth 3: GIS = ArcGIS

ArcGIS is a product sold by a company (ESRI).
This isn't like Kleenex and tissues or Google and, well, "google."
There are lots of other software options for GIS, many of which are free and open source.
If you want a GUI like ArcGIS, QGIS is one such option.
Today, we'll be using free, open-source code-based examples using Python.

top

What is a Spatial or Coordinate Reference System?

Map projections try to portray the surface of the earth, or a portion of the earth, on a flat piece of paper or computer screen. In layman’s term, map projections try to transform the earth from its spherical shape (3D) to a planar shape (2D).

A coordinate reference system (CRS) then defines how the two-dimensional, projected map in your GIS relates to real places on the earth. The decision of which map projection and CRS to use depends on the regional extent of the area you want to work in, on the analysis you want to do, and often on the availability of data.

Key points:

https://docs.qgis.org/3.16/en/docs/gentle_gis_introduction/coordinate_reference_systems.html

top

What is spatial data? Explanation of common formats

Shapefile

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products.

The shapefile format can spatially describe vector features: points, lines, and polygons.

https://en.wikipedia.org/wiki/Shapefile

shapefile_image.png

GeoJSON

GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format.

The features include points (therefore addresses and locations), line strings (therefore streets, highways and boundaries), polygons (countries, provinces, tracts of land), and multi-part collections of these types. GeoJSON features need not represent entities of the physical world only; mobile routing and navigation apps, for example, might describe their service coverage using GeoJSON.

The GeoJSON format differs from other GIS standards in that it was written and is maintained not by a formal standards organization, but by an Internet working group of developers.

https://en.wikipedia.org/wiki/GeoJSON

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [102.0, 0.5]
      },
      "properties": {
        "prop0": "value0"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
        ]
      },
      "properties": {
        "prop0": "value0",
        "prop1": 0.0
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
            [100.0, 1.0], [100.0, 0.0]
          ]
        ]
      },
      "properties": {
        "prop0": "value0",
        "prop1": { "this": "that" }
      }
    }
  ]
}

TopoJSON

TopoJSON is like GeoJSON, but files are generally smaller. Shared boundaries don't require duplicate representations in the schema.

top

What is geocoding?

Geocoding is the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of a place—to a location on the earth's surface.

https://desktop.arcgis.com/en/arcmap/latest/manage-data/geocoding/what-is-geocoding.htm

Example using Geocodio
https://www.geocod.io/docs/?python#single-address

top

Let's pretend we want to look at the neighborhoods where NYC's public hospitals are located, and see how they compare to the neighborhoods with private hospitals. Are there differences in average neighborhood income?

We can answer this question using only open data and open-source tools

How to perform an attribute join between spatial data and tabular data

NYC Selected Facilities file has lots of useful info, including data on all the hospitals in NYC:

https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-selfac.page

There are all sorts of facilities listed, but let's subset the file to only hospital facilities. We don't care about the rest.

This list would have looked pretty different just six years ago!

Lots of resources to help you work with the Census/ACS API:

We want to pull 2019 ACS 5-year data on income, at the Census Tract level

Set up API parameters:

jump back down

I don't want to have to type 'B19013_001E' each time I call this column

Quick descriptive statistics. Something looks weird...

https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html
-666666666: The estimate could not be computed because there were an insufficient number of sample observations.
We'll replace it with NaN

We cannot use 'tract' alone as a unique key. It's not unique across counties. We can combine it with 'county' for a unique key.

This is a really good resource on Census GEOIDS:
https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html

Facilities file gives borough code but not county FIPS code. It's easy to re-map.
Borough codes are here: https://en.wikipedia.org/wiki/Borough,_Block_and_Lot
And I can see the county FIPS code correspondences in the data table above.

Our first join. This will give us the hospitals merged with the Census Tract-level income data.

More columns than we need, so let's subset.

I did have some null values. I'm just going to ignore that. This probably woudn't be a great analysis IRL.

Split the dataset based on whether they're H+H hospitals or not.

Not surprisingly, average income is higher in the neighborhoods (defined by Census Tract) where private hospitals are located.

top

How to perform a spatial join between points and polygons

I can mostly reuse all the API parameters I established already:
jump up
Except I want to fetch Block Groups this time.

Now I'm going to more or less repeat a bunch of steps from above, to get Block Groups.

But! My facilities file didn't have Block Groups as an attribute. So I need to do a spatial join this time (fancy!).
Geopandas makes it easy.
I'm going to time some of these steps.

Download shapefiles from here:

https://www.census.gov/cgi-bin/geo/shapefiles/index.php
Select year: 2010, layer type: Block Groups, submit
Select state: New York, download state file

Geopandas knows how to read a shapefile

Check the coordinate reference system:

https://epsg.io/4269
United States (USA) - New York., accuracy 1.0 m

Look how simple it is to plot! It ain't pretty, but it's easy.

Now I have all the Census Block Groups in NYS, and I want to do a spatial join with the hospital facilities.
To do this, I first need to create a GeoDataFrame from the facilities using the latitude/longitude.

This is the syntax to create a GeoDataFrame using Geopandas.
It's important to set the CRS to the same as was used for the Census Block Groups:

Plotting these 61 points is much quicker.
You can just barely make out the shape of NYC.

Double check the CRS's match

Now, everything is set up for the spatial join.
I'm doing a left join onto the hospital points, so I will lop off all the extraneous block groups and be left with only 61.

But wait! I still don't have the income data attached.
I fetched that from the Census API a while back, but haven't yet attached it. Time to do that.

This time, I don't need to worry about converting borough codes to county codes. The shapefile had the county codes included.

Too many columns.

Again, nulls. Again, ignored.

Well, at least the null values seem to be roughly nondifferentially distributed among public & private hospitals.

Again, private hospitals seem to be in wealthier neighborhoods (by income) than private hospitals. And both are a bit higher when we zero into Block Groups compared to Census Tracts.

If you were really doing this analysis, you'd want to pay attention to margins of error and such.

top

How to create a choropleth map

Choro-what? Doesn't that have something to do with plants?

Interestingly, the word "choropleth" has nothing to do with color.
"Choro" comes from ancient Greek for "region" or "location," whereas "chloro" comes from ancient Greek for "green" (as in "chlorophyll").
And "pleth" is "many," as in "plethora."

But sometimes, choropleth maps are green.

610px-Australian_Census_2011_demographic_map_-_Australia_by_SLA_-_BCP_field_2715_Christianity_Anglican_Persons.svg.png

People who identify as Anglican as a fraction of total persons, in Australia, according to the 2011 census results.

By Toby Hudson based on data from the Australian Bureau of Statistics, CC BY-SA 3.0 au, Link

Creating a beautiful choropleth map must take along time, right?

Doesn't have to.

I'm going to re-do my merge, this time adding income data onto all the CBG's in NYS (not only those with hospitals in NYC).

Plotly makes really nice interactive graphics

Set up some layout parameters

Here's the choropleth map. That's it!

Let's add the hospitals.
First, let's create a new variable to bifurcate public & private.

Set a tasty color scale

And add the points to the plot

top

My sincere hope is you come away from this session with the feeling that this stuff is fun.